MADOBIS: Aplicaciones y Discusiones en Desarrollo Animal
Set your working directory and load the libraries we need.
# set working directory
setwd("~/Documents/git_projects/ADDA_taller2_2022/")
# install pacman if not already installed
if (!require("pacman")) install.packages("pacman")
# use pacman to load libraries
pacman::p_load(tidyverse,gprofiler2)
From the previous exercises on differential gene expression, we looked at some comparisons that resulted in >300 genes that were over or under expressed. It is not trivial to make sense of that many genes. One way to try to distill this large list of genes is a functional enrichment analysis. Essentially, we want to use curated lists of genes that belong to a specific functional category. Perhaps the most widely used knowledge base of this kind, is the Gene Ontology, but there are others, such as the [Kyoto Encyclopedia of Genes and Genomes] (https://www.genome.jp/kegg/) and the Reactome.
I will assume that you are familiar with the general concepts of functional enrichment analyses and the functional databases. In brief, the aim is to identify whether genes that belong to sets of annotated genes (e.g. a GO term) are over-represented (i.e. enriched) in a set of genes of interest (e.g. a list of differentially expressed genes). Often, this is also called a Over Representation Analysis.
To perform such as test, we therefore need:
Lets get to it!
Let's load the results files we need.
# the list of DEG results from the previous exercises
res<-readRDS("./results/deseq2_results.rds")
# the annotations
xtrop<-read_csv("./data/PCU23_annotations_xtr105.csv")
Ideally, the annotations of your genes should come from experimental evidence from your organism. If you work with mice, drosophila or humans for example, many functional enrichment analysis tools will be very easy to implement because they will automatically connect your lists of genes to annotations, background lists etc. This is unlikely going to exist if you work with non-model systems. We are therefore dependent on making our own annotations and lists.
For our annotations, we are using BLAST results from querying our genes against the proteome of Xenopus tropicalis. This is a well studied frog species. It is still only distantly related to our focal species, but this is the best we can do. As we saw in the previous exercise, this results in many genes not having any annotations. These will be excluded unfortuantely.
First we will use the same bit of code from the previous exercise to make a list of DEGs per comparison.
extract_degs<-function(x) {
return(
x %>%
as_tibble(rownames = "gene") %>%
filter(padj<0.05) %>%
pull(gene)
)
}
# now extract all
sig_deg<-lapply(res, FUN=extract_degs)
## Warning: package 'BiocGenerics' was built under R version 4.0.5
## Warning: package 'GenomeInfoDb' was built under R version 4.0.5
str(sig_deg)
## List of 4
## $ bD_bV: chr [1:395] "PECUL23A000242" "PECUL23A000289" "PECUL23A000301" "PECUL23A000483" ...
## $ wD_wV: chr [1:370] "PECUL23A000301" "PECUL23A000532" "PECUL23A000547" "PECUL23A000601" ...
## $ bV_wV: chr [1:98] "PECUL23A000301" "PECUL23A000532" "PECUL23A002321" "PECUL23A004402" ...
## $ bD_wD: chr [1:73] "PECUL23A000532" "PECUL23A000872" "PECUL23A002033" "PECUL23A002231" ...
We now have to convert those to Xenopus tropicalis IDs, by extracting the matching annotations. Because we will do this mutliple times over list itmes, it is cleaner to write a function first, then apply it to the lis.
# make a function that extracts matching X. tropicalis IDs
extract_xtr<-function(x) {
return(
xtrop %>%
filter(gene_id %in% x) %>%
pull(xtr_pep_id_x) %>%
unique()
)
}
# apply function to list of Pelobates IDs
xtr_deg<-lapply(sig_deg, FUN=extract_xtr)
str(xtr_deg)
## List of 4
## $ bD_bV: chr [1:434] "ENSXETP00000052139" "ENSXETP00000083555" "ENSXETP00000092919" "ENSXETP00000049789" ...
## $ wD_wV: chr [1:402] "ENSXETP00000052139" "ENSXETP00000083555" "ENSXETP00000093342" "ENSXETP00000029853" ...
## $ bV_wV: chr [1:99] "ENSXETP00000093342" "ENSXETP00000008372" "ENSXETP00000021288" "ENSXETP00000011719" ...
## $ bD_wD: chr [1:75] "ENSXETP00000097551" "ENSXETP00000041927" "ENSXETP00000008372" "ENSXETP00000021288" ...
Question: Are the gene sets in the list of Xenopus genes the same length as the original DEG list? How and why are they different?
sapply(sig_deg, length)
## bD_bV wD_wV bV_wV bD_wD
## 395 370 98 73
sapply(xtr_deg, length)
## bD_bV wD_wV bV_wV bD_wD
## 434 402 99 75
The gene sets are smaller. This is because not every gene for our organism could be annotated.
There is some discussion about what makes a good background. Ideally, it should be the complete list of genes that could be differentially expressed. But what is this?
Question: A good background is:
- All genes in the Xenopus tropicalis proteome
- All genes that could be annotated in the Pelobates genome
- All genes in the tissue-specific transcriptome
It is not always clear, but in this case, I would argue that the tissue specific-transcriptome is the closest right answer. Even here, there are complications. We treat all skin to be the same tissue, but clearly the dorsal and ventral skin are very different.
We will use the full set of genes that were returned by DESeq2. This set should have filtered out genes that have low counts (i.e. unlikely to be expressed across any of our tissues/conditions).
We can use the same function from earlier to convert our list of Pelobates IDs to Xenopus peptide IDs.
xtr_bg<-extract_xtr(rownames(res$bD_bV))
str(xtr_bg)
## chr [1:16588] "ENSXETP00000036152" "ENSXETP00000094659" ...
We are now ready to go! There are a number of software and R packages that let you perform functional enrichment analysis. Here, we will use [g:Profiler])(https://biit.cs.ut.ee/gprofiler/), because it plays particularly well with R and with Ensembl gene/peptide annotations.
The analysis can be performed with a single command, even if our query is a list of multiple gene sets!
An important thing to remember is that the associated R package gprofiler2 is just an API, and the actual analysis will be performed on the g:Profiler server. This version is continuously updated, to match the updates Because of this, it is important to tell gprofiler which version of Ensembl you would like to use. Our annotations came from version 105, this is currently the default so lets use that version.
# set base url:
set_base_url("http://biit.cs.ut.ee/gprofiler")
# check archived URLs in case you are using older versions of Ensembl here: https://biit.cs.ut.ee/gprofiler/page/archives
# run the analysis
res_ora<-gost(multi_query = FALSE, # returns separate results tables for multiquery
custom_bg = xtr_bg, # our background
query=xtr_deg, # our list of gene sets
organism="xtropicalis", # the organism our annotations belong to
exclude_iea = FALSE, # include GO terms that were electronically assigned
correction_method = "gSCS", # the recommended multiple testing correction.
sources=c("GO:BP","GO:CC","GO:MF", "KEGG","REAC"), # the functional sets we are interested in
evcodes=FALSE, ## evcodes TRUE needed for downstream analysis like enrichment maps in Cytoscape, but this takes longer.
significant= FALSE) # return all terms, not just the significant ones
# the results are stored as a "results" dataframe
head(res_ora$result)
## query significant p_value term_size query_size intersection_size
## 1 bD_bV TRUE 3.488191e-06 1518 368 79
## 2 bD_bV TRUE 1.986167e-04 1436 368 71
## 3 bD_bV TRUE 3.379366e-03 1176 368 58
## 4 bD_bV TRUE 3.746630e-03 1322 368 63
## 5 bD_bV TRUE 3.816698e-03 1069 368 54
## 6 bD_bV TRUE 6.253817e-03 19 368 6
## precision recall term_id source term_name
## 1 0.21467391 0.05204216 GO:0032501 GO:BP multicellular organismal process
## 2 0.19293478 0.04944290 GO:0032502 GO:BP developmental process
## 3 0.15760870 0.04931973 GO:0007275 GO:BP multicellular organism development
## 4 0.17119565 0.04765507 GO:0048856 GO:BP anatomical structure development
## 5 0.14673913 0.05051450 GO:0048731 GO:BP system development
## 6 0.01630435 0.31578947 GO:0048066 GO:BP developmental pigmentation
## effective_domain_size source_order parents
## 1 13707 8813 GO:0008150
## 2 13707 8814 GO:0008150
## 3 13707 3168 GO:0032501, GO:0048856
## 4 13707 15160 GO:0032502
## 5 13707 15045 GO:0007275, GO:0048856
## 6 13707 14535 GO:0043473
Question: What do these resuls look like and how many functionally enriched terms are there for each gene set?:
The results returned tell us a few things:
colnames(res_ora$result)
## [1] "query" "significant" "p_value"
## [4] "term_size" "query_size" "intersection_size"
## [7] "precision" "recall" "term_id"
## [10] "source" "term_name" "effective_domain_size"
## [13] "source_order" "parents"
most importantly:
term_id and assocated term_name which are the functional terms from the sources we requested (GO, KEGG, Reactome).p_value, which is actually the adjusted p-value (misleading!), that the given term is enriched (over represented) in our query (gene set).term_size, query_size and intersect_size which tell you how many genes make up the given term, how many genes were in your query and how many genes from both are overlapping.Question: How many terms have been significantly enriched for each of the comparisons?
We can use a p_value cutoff of 0.05 to see how many terms have been functionally enriched in each term.
res_ora$result %>%
filter(p_value<0.05) %>%
group_by(query) %>%
dplyr::count(query, sort=TRUE)
## # A tibble: 4 × 2
## # Groups: query [4]
## query n
## <chr> <int>
## 1 wD_wV 26
## 2 bD_bV 24
## 3 bV_wV 24
## 4 bD_wD 8
We see that the dorsal-ventral comparisons have the most enriched terms (17 and 14) and the dorsal-dorsal comparison has the fewest (7) This is not surprising given the small number of genes in that last set.
gprofiler also has a few visualization tools as well. For example an interactive Manhattan-style plot:
gostplot(res_ora)
## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## Please use `gather()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
Question: What REACTOME pathway is enriched for all of the comparisons?
Another useful plot to show enrichment results are dot plots. We can easily make our custom dot plot using the gprofiler results tables and ggplot.
res_ora$result %>%
select(query,term_name, p_value, intersection_size, query_size,source) %>%
filter(p_value<0.05) %>%
mutate(GeneRatio=intersection_size/query_size) %>%
arrange(GeneRatio) %>%
mutate(term_name = factor(term_name, levels=unique(term_name))) %>%
ggplot(aes(x=GeneRatio, y=term_name)) +
geom_point(aes(color=p_value, size=intersection_size)) +
ylab("") +
scale_colour_viridis_c(direction = 1, option = "magma") +
facet_grid(source~query,scales = "free_y",space = "free") +
theme_bw()
Question: After having done the differential gene expression analysis and performed a functional enrichment analysis, what have we learned about the pigmentation plasticity of tadpoles?
We have learned for example that: * melanin differential biosynthesis is relevant for all comparisons (even white ventral and black ventral?). * Dorsal-ventral differentiation is much more complicated and involves many. non-pigment related processes such as "system development" and "extracellular matrix".